[Probability Surveys~| 

Vol. 4 (2007) 146-171 

ISSN: 1549-5787 

DOI: 10.1214/07-PS092I 

Notes on the occupancy problem with 
infinitely many boxes: general 
asymptotics and power 

Alexander Gnedin 

Utrecht University 
e-mail: gnedinSmat h ■ uu . nl | 

Ben Hansen 

University of Michigan 
e-mail: ibbhSumich . edul 

Jim Pitman 

University of California, Berkeley 
e-mail: pitmanSstat . Berkeley . EDU 

Abstract: This paper collects facts about the number of occupied boxes in 
the classical balls-in-boxes occupancy scheme with infinitely many positive 
frequencies: equivalently, about the number of species represented in sam- 
ples from populations with infinitely many species. We present moments of 
this random variable, discuss asymptotic relations among them and with re- 
lated random variables, and draw connections with regular variation, which 
appears in various manifestations. 

AMS 2000 subject classifications: Primary 60F05, 60F15; secondary 
60C05. 

Keywords and phrases: occupancy problem, regular variation, asymp- 
totics, poissonization, species sampling. 

Received January 2007. 



Contents 

1 Introduction 

2 Moments and poissonization 

3 Some estimates of tlie moments 

4 Laws of large numbers 

5 CLT for X„ 

6 The variance 

7 Regularly varying frequencies 

8 Two uses of inversion 

9 The cumulative frequency of empty boxes 



147 
148 
152 



153 



154 



156 
158 
163 



165 



'Research supported in part by N.S.F. Grant DMS-0405779 and N.I.C.H.D. Grant 
HD045753-01 



146 



A. Gnedin et al. /Notes on the occupancy problem with infinitely many boxes 



147 



10 Pure power laws 167 

11 Strong laws for large parts 168 

References 17C 



1. Introduction 



We consider the classical multinomial occupancy scheme in which balls are 
thrown independently at a fixed infinite series of boxes, with probability pj 
of hitting the jth box. The frequencies {pj, j = 1,2, . . .) are assumed nonin- 
creasing, strictly positive and satisfying ^jPj = 1. As n balls are thrown, their 
allocation is captured by the array Xn — {Xn,j, j — 1,2, . . .), where Xnj is the 
number of balls out of the first n that fall in box j. 

In concrete applications, instead of boxes one has types or species of sampling 
units, and the sample array of types, X„, is of interest for what it reveals about 
the population frequencies {pj). Such species sampling problems arise in ecology, 
to be sure, but also in database query optimization, where the sampling units 
may be entries in columns of a database while the species consist of all of distinct 
values appearing in the column [TU]; in literature, where the sampling units may 
be words appearing in a given author's known works while the species consist 
of all words known to that author |14] : in disclosure risk limitation, where the 
sampling units may be people or firms listed in a microdata file, without names 
or other overtly identifying information, while the types are unique combinations 
of values of variables with which the people or firms might be implicitly identified 
|34| : and in many other areas [9]. Models positing infinitely many boxes or 
species may approximate sampling from large, finite populations, or they may be 
useful as models of 'superpopulations' from which both samples and background 
populations are notionally drawn. 

A functional of A"„ which appears in many contexts is the number of nonempty 
boxes 

if„ - #{j : X„,, > 0}. 

Kn is sometimes regarded as a measure of diversity of the sample. More detailed 
information is carried by the counts of boxes occupied by exactly r balls 

Kr,,r = Mj- Xr..3=r} (r = l,2,...), 

so that Kn = J2r ^n.r and r Kn.r — n. The combinatorial object encoded 
into the array of counts {Kn^i, . . . , Kn,n) is a random partition of integer n; this 
partition has Kn parts which correspond to positive entries of Xn. 

The variables Kn and Kn.r^s have been intensively studied in the occupancy 
scheme with finitely many positive frequencies, which in some models may vary 
in a certain way with n. The most studied is, of course, the classical case of 
m equal frequencies (e.g. in the familiar 'birthday paradox' one is interested in 
the probability of the event Kn,i < n). Kolchin, Sevast'yanov and Chistyakov 
[29j identified five distinct asymptotic regimes for m = m(n) — > cx) to secure 
either a Poisson or a normal limit distribution for Kn (more precisely, they 
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discussed the number of empty boxes m — Kn)', for example, when m — n, both 
the mean and the variance of Kn grow approximately linearly with n, and the 
limit distribution is normal. See [27l [26l [29j for extensions, surveys and many 
references. 

In contrast to that, the literature on the problem with infinitely many fixed 
positive frequencies is rare. For fixed frequencies (pj) the asymptotic growth 
of each Xnj (as n oo) is linear, and the growth of Kn is always sublinear 
since the main contribution to Kn occurs due to the boxes whose frequencies 
are arbitrarily small. More delicate asymptotics of the moments of Kn and Kn,r 
are determined by the way pj approach 0. The first systematic study of the 
asymptotics appeared in a remarkable paper by Karlin |28j , in which he proved 
a central limit theorem under a condition of regular variation on the frequencies 



Recently, two new sources of interest to the infinite model emerged. On the 
one hand, it has been observed that 'power laws' for Kn are quite common in 
partition- valued processes of coagulation and fragmentation (see [32 l [22 t [24 l [5] ) : 
these fall in the range of regular variation with positive index. On the other hand, 
the case of geometric sequence (pj) (which is an instance of slow variation) was 
intensively studied in connection with the analysis of algorithms [2 [30] . 

These notes present a kind of a survey which extends and updates Karlin's 
results on the infinite occupancy scheme. The results in Sections [2][6] and [10] 
are of general nature, while in Sections [39] we work under the assumption of 
regular variation. In particular, we record various guises of regular variation, and 
mention some recent developments. Some of the results are new and others are 
scattered in the literature and have been several times re-discovered (especially 
the results in Section [TO]) . 

Notation. Throughout c, Cj denote positive constants whose values are not im- 
portant and may change from line to line. We use f'^g, f^g, f^g and 
/ X g for f/g 1, f/g 0, f/g oo and ci < f/g < C2, respectively. When 
one either of / and g is a random quantity, the notation / '^a.s g, f ^a.s g etc. 
means that the asymptotic relation holds with probability one. Convergence in 
distribution is denoted — s-^. 

2. Moments and poissonization 

We start by recalling the familiar fact that Xn {Xn.ii ■ ■ •) has a multi- 
nomial distribution with parameters (n, {pj)): 



P(X, 



nj, j = 1,2,.. .) 





J 
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From this the distribution of partition {Kn^i, . . . , Kn,n) is recovered by summa- 
tion. SpecificaUy, for (fci, . . . , fc„) a fixed partition of n 

"-1- distinct 

where ni, . . . ,n^. is a sequence of length k = fc^, with hi terms equal to i 
for i — 1, . . . ,n. The infinite sum is called the monomial symmetric function in 
the variables pj . Formulas for the distribution of and marginal distributions 
of Kn^s follow by further summation over partitions of n. In terms of the 
generating function, the probability of Kn — m is equal to the coefficient at 
x^y"^ /n\ in the series expansion of the infinite product 

n(i + y(e^^"-i))- 

j 

Formulas for the moments follow from the representation 

= ^ 1 (X„,, = r) , X„ = ^ 1 (X„,, > 0) , (1) 
j j 

(where 1(- • ■ ) equals 1 if • • ■ is true and equals otherwise). Denoting 

:=E[X„], -.^^[KnA 

we easily see that 



Ci>„ = 5](l-(l-p,r), ^n.r=[^)Y.p]{l-PjT-'-. (2) 

These are related by the formulas 



n 
r 

where A*" is the rth iterate of the difference operator A<i>„ = — <i>„_i. Lengthy 
but straightforward computations yield formulas for the variance 

K := Var [K^] = $2n - + ^ [(^ ~ ~ PiT " (1 " " PoT], (3) 

K 

(Tr) 

n> r / X 



Vn,r Var [K^^ = - $(2„, 2r) + (4) 



^ (1 -p,: -p,)"-^'' - ( )(i -p,r 



Formulas for higher moments and covariance can be derived in the same way, 
but they seem to be of little practical use in the explicit form. See [29l Section 
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3.1] for such moments computations, which are also vahd in the case of infinitely 
many positive frequencies. 

One major obstacle in the study of counts Kn and Kn^s is that the indicators 
in IT]) are not independent. A common recipe to circumvent this difficulty is to 
first consider a closely related type of model in which the balls arc thrown in 
continuous time at epochs of a unit rate Poisson process {P{t), t > 0), which is 
independent of (X„, n = 1,2,.. .). The advantage of this randomization is that 
one can exploit independence, since the balls fall then in the boxes according 
to independent Poisson processes {Xj{t), t > 0), aX rate pj for box j. Once 
the properties of the Poisson allocation scheme are acquired, one still needs 
to translate them in the fixed-n results, by using a kind of depoissonization 
technique. 

Remark. Poissonization liberates us from the constraints of the fixed-n scheme. 
The Poisson model is well defined for arbitrary positive rates pj which need not 
satisfy pj < cxd or even pj < 1. If pj < oo, a reduction to the normalized 
case J2jPj = 1 is maintained by the obvious time-change t i-^ t/J2jPj ■ 1^ 
^jPj = oo, K(t) is infinite with probability one, though Kr{tYs can be still 
finite. One can also consider two-sided infinite sequences like geometric {pj = 
, j G Z) (with < q < 1), for which the Kr{tys and the sum of variances 
V{t) := Ejt-oo Var [l{X{t) > 0)] = j:T=-ooi^''''' " e"'^^*) are aU finite. 

The convention in this paper is that the quantities associated with the Pois- 
son allocation scheme appear in the functional notation, while the lower-index 
notation is reserved for the fixed-n scheme. For instance, Xj{t) = Xp(^t),ji 

K{t) := ifp(i) = l{X,{t) > 0), Krit) Xp(t)^, = ^ = r). (5) 

j j 

For the poissonized moments $(<) := E [K{t)], <i>r(<) := E [Kr{t)] we have 

$(t) = ^(i-e-*^0, Mt)^^:^J2Pj^~''" (^ = 1,2,...), 
i ' j 

and these are related via 

r 

where is the rth derivative. Formulas for the variance are simpler than 

([3]) and (j4|) because the cross-terms disappear due to independence: 

V{t):=Va.r[K{t)]=^2t)-^{t), Vr{t) := Var [/^^(i)] = $^(t)-2"2'' ^^J"^ $2r(2i). 

(6) 
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Similarly, for integer r ^ s, using Kr{t) + Ks{t) = J2j '^{^jW ^ {^j'S}) the 
covariance is computed as 

cov{Kr(t),KM} = -2-'-'(^^''^<S>r+s{2t). (7) 

For analytical reasons which will be soon clear it is convenient to encode the 
frequencies (pj) into an infinite counting measure 

i^idx) := ^(5p^(da;) 
j 

on ]0, 1[, where 6x is the Dirac mass at x. Equivalently, 

Y.f{Pj)= C f{x)v{dx) (8) 

holds for arbitrary / > 0. In particular, 

= (\\-{l~xr)v{dx), (9) 

JO 

^n,r = (^^ j^x-{l-xr-^y(dx) (r = l,2,...), (10) 
m - /V-e-*-)Kdx), (11) 

JO 

<^Jt) - ^ / x''e-*V(da;) (r = l,2,...). (12) 
Jo 

Remarks. The formulas for expected values remain exactly the same when 
the frequencies ijpj) are random, in which case the 'intensity measure' v is de- 
fined by taking expectation in the left-hand side of ^ . See the recent work on 
composition structures [2H HSl S] for more in this direction. 

The summability constraint "YlijPj — 1 translates as j^xv{dx) = 1, and 
implies that v can be also interpreted as a Levy measure of some subordinator, 
which jumps by pj at rate 1 for each j. In this interpretation pTjl has the 
meaning of a Laplace exponent [20j . 

Both $(i) and <!>„ (considered as functions of a real or complex- valued argu- 
ment) are Bernstein functions which uniquely determine see [20) . They are 
related by the poissonization identity 

°° — t j-n 

$(t)=E[(f>p(4)] = ^^<i>„ (i>0). 

n=0 
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3. Some estimates of the moments 

Monotonicity is a key feature of Kn- In fact, we have Kn ta.s. oo (as n — > oo) 
and K{t) |a.s. oo (as t — > oo), because each box is eventually discovered by a 
ball. 

By monotone convergence, also $„ t oo and $(i) "f oo. However, the growth 
is always sublinear: $„ ^ n (n oo) and $(i) <C t (t ^ oo). Indeed, if 
we ignore the first J boxes, the mean number of discovered boxes among the 
remaining ones is at most nJ^jyjPj (respectively, at most tJ^jyjPj in the 
Poisson scheme) . Thus $„ < J+nJ2j>jPj (respectively, $(i) < J+tJ2j>jPj)j 
and selecting J arbitrarily large we see that hm sup $(n) /n = lim sup ^{t)/t = 0. 
Aside from these two general features, the growth properties of Kn can be fairly 
arbitrary. 

The next lemma gives general estimates of closeness of the moments in the 
fixed-71 scheme and the Poisson scheme. 

Lemma 1. For n —> oo the following estimates hold: 

|$(n)-$„| < -$2(n)^0, 
n 

\^r{n) - <^n,r\ < - max {$,. (n) , $.^+2 (^i) } 0, 
n 

c 



\V{n)-Vn\ < - max |$i(n), $i(n)^| < - max |l, $i(n)^| , 
n n 

\Vr{n)-Vn,r\ < - max { ^.^ (n)^ (n)^ (n) , $^+2 (n) , $2r (2n) } 



Proof. The first two bounds follow from the elementary inequality < e 
(1 — x)" < nx^e~'™ valid for < a; < 1. For instance, 



|$(n)-$„| = f (e-"^ - (1 - a;)") i^(dx) < n / x'^e~"'''v{dx) 
Jo Jo 

2 2 
= -<i>2(n) < -<i>(n) 0. 

n n 

The last two bounds follow from this and estimates of the cross-terms in ([3]) 
and (HI), by using the expansion 

(a - 6)" = a" + na"-i& + 0{n'^a''-'^h^) [a > h) 

with a = (1 -Pi){l - Pj) = (1 ~Pi -Pj +PtPj) and h ^ PiPj. □ 

Taken together with ^{t) f oo the lemma implies $„ $(n). 

Remark. It seems plausible that the relations V{n) ~ Vn and ^r{n) ~ ^n,r 
are true for arbitrary {pj). However, the estimates in the lemma are not strong 
enough to entail such a conclusion in full generality, although it has been shown 
under various circumstances (see e.g. Lemma |4] below and Section [6]) and no 
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counterexamples are known. Hwang and Janson 25, Proposition 4.3(ii)] show 
that always V{n) >; Vn- The difficulty is that V{t) and ^r{t) may exhibit rather 
irregular oscillatory behaviour. For instance V{t), Vn, <&r(i) and ^n,r may ap- 
proach arbitrarily closely for some n or t, (though they cannot converge to 0, 
as is seen by selecting a subsequence rij >; ^/Pj or tj x l/pj, respectively). 

We denote 



the right tail of ly. Note that there are at most m frequencies not smaller than 
1/m, hence v{\/m) < m. Moreover, we have i^{x) for x I 0. Indeed, 

integrating by parts for x > we obtain 



As a; I the integral at left increases to 1, entailing by monotonicity that the 
integral at right converges, and by extension that xiy{x) converges to a limit 
also. Since lima; I'^x) > would force the integral at right to diverge, x iy{x) — > 0. 

4. Laws of large numbers 

The mean number of occupied boxes satisfies <I>(i-|-r) — < $(r) (for T,t > 0). 
One way to justify this is by noting that the mean number of distinct boxes hit 
during any time interval [t, r + t] is $(<), but some of them have been discovered 
before time t and do not contribute to K{t+T). The same follows from concavity 
of ^{t), which implies 



Similar inequalities hold for $„. Using these the variance can be bounded via 
expectation as 



V{t) = $(2i) - $(t) < $(<) = E [Kit)], Vn < $2„ - < = E [Kn]. 



Applying Chebyshev's inequality and recalling that t cx) and $(i) t oo, 
the bound on the variance allows one to conclude that both K{t)/^{t) and 
Kn/^n converge to 1 in probability, which is a result due to Bahadur [3]. A 
similar analysis invoking ([6]), (|4]) and Lemma [1] shows that also Kr{t)/^r{t) and 
Kn,r/^n.r convcrgc to 1 in probability, provided ^r{t) —>- oo. For Kn and K{t) 
there is the following strengthening due to Karlin [28, Theorem 8]. 

Proposition 2. For arbitrary (pj) 



v{x) := v[x, oo[ 




$(< + T)-$(t) <I>(t)-$(0) 



Kn^^,.s.^n, K [t)^ ,,,,^{t) . 
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Proof. The function (f>(t) is continuous, increasing and satisfies ^'{t) < 1. Thus 
it is possible to select an increasing sequence {t^, ni = f,2,...) such that 
< ^{tm) < rn? + f . We have then from the above estimate of the vari- 
ance f{\K{tm)/^{tm) — 1| > e) < e~^m^^, and by summability of the bound 
K(tm)/^{tm) ^a.s. 1 along the subsequence. For tm < t < the mono- 

tonicity implies K{tm) < K{t) < K{tm+i) and ^{t,n) < < ^{tm+i)- This 
allows one to squeeze the ratio as 

where both sides converge to 1 almost surely, in consequence of the above and 
4>(t„i)/$(t„j_|_i) ^ 1. The argument for Kn is completely analogous. □ 

Instead of using the Chcbyshev inequality one can exploit a finer Bernstein- 
type large deviation bound for sums of independent bounded variables |161 p. 
911]: 

P(|/^(t)/$(t)-l| >e) <ce-^'*(*), 

where e' depends on e. This allows one to choose a subsequence {tm} with 
smaller gaps, so that ^{t,n+i) — <&(im) ^ m~^. 

Using monotonicity of X]r>s ^ri,r , we obtain along the same lines for every 
fixed integer s 

^i^n,r~a.s. ^$n,r, ^ r W ^^a.s. ^ *r (*) and ^ $„,^ - ^ ^.^ (i) . 
r>s r>s r>s r>s r>s r>s 

Remark. The relations Kn,r^ii.s.^n,r, ^r(^)~a.s.*I'r(i) may fail, simply because 
4>r(i) need not go to oo, while the counting processes have unit jumps. It is nat- 
ural to conjecture that these laws of large numbers are true under the condition 
^r{t) OO, but we do not know if this has been proved in full generality. 

5. CLT for Kn 

Recall the representation ([5|) of K{t) as a sum of independent indicators. Apply- 
ing the Lindeberg-Feller condition, we see that {K{t) — ^{t))/V{ty^^ converges 
to the standard normal distribution provided V(t) — > oo. The following depois- 
sonization argument leading to the CLT for Kn follows the line in Dutko [T^ . 
Monotonicity of Kn plays here a central role. 

Proposition 3. The conditions V{t) — > oo and Var [Kn] oo are equivalent, 

1/2 

and if they hold, the law of {Kn — an)/&n converges to the standard normal 
distribution, where or $(n) can be selected for the constant an and V{n) or 
Vn for bn- 

The next lemma will imply that both choices for 6„ are good. 
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Lemma 4. The conditions Var [Kn] oo and V{t) oo are equivalent, and 
imply V{n) Vn- 

Proof. We first show that <€. tV{t). Denote for shorthand 

:=i-i$i(i) = $'(t). 
Note that 4''{t) < 0. Since l'{x) < oo for a; > we have 

(j){t) ~ f a;e-*^i'(dx) {t oo) 
Jo 

for every e > 0. Thus by Cauchy-Schwarz (applied to the measure xv^dx)) 



xe *'^i/{dx) 



< / xiy{dx) / xe-2*^i/(da:) ~ </>(2t) / xi/(dx), 



and letting e — > the first integral factor vanishes, which yields ^ 
Using that (j) is decreasing, 



(/i(u)du > t(t){2t) > = t-^^i{tf, 

as wanted. Now, if V{t) oo then the statement of the lemma follows from 
the above and the third estimate in Lemma [1] If 1^ — > oo then <i>2„ — ~* oo 
since the mixed term in ^ is negative, and by the first estimate in Lemma [1] 
also 3>(2t) - oo. □ 

The rest of the argument for Kn is as follows. From <C 0(2t) in the 

proof of Lemma [4] we get 

($(2t) - $(t))V2 (/f 0(^,)d«)i/2 - M2i))i/2 • ^ > 

This implies 

F(t + cti/2)/-i/(t) ^ 1 (14) 

provided V{t) — > oo. Indeed, 

V{t + ctV^) _ $(2t + 2cti/^) ~$(t + cti/^) _ 
V(t) ~ $(2i) - $(i) ~ 

$(2t + 2cti/2) -$(2t) $(4t)-$(2i) $(t + cti/2) - $(t) 

$(4t) - $(2t) $(2t) - $(t) $(2t) - $(i) ^ ' 

which tends to 1 by Uni) and because ($(4t) - $(2t))/($(2t) - $(t)) < 2 in 
consequence of <i?"(t) < 0. From p3)) and Lemma [TJ 

e\/(n)^'^ el/(n)^'^ 
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In the very same way, and also using (jT4|) 

1/2 



Therefore \K^^^^i/2 — Kn\/V{ny^^ converge to in probability. Choosing c 
sufficiently large we have for the Poisson process n — cn^^^ < P{n) < n + cv}!"^ 
and therefore ii'„_j,„i/2 < K(ji) < _ftr„_(_^„i/2 with probability larger 1 — e. If 
follows that {Kn - K{n))/V{ny/'^ converge to in probability. The CLT for 
Kn now follows from this and the CLT for K{n). □ 



Remark. Hwang and Janson [53] have shown a more delicate local CLT for 
Kn under V(t) ^ oo. If and Var [Kr{t)] tend to oo, then a CLT holds for 

Kr{t), and one can naturally suspect that the same is valid for Kn.r (this seems 
to have not been discussed in the literature). Mikhailov [31. proves a CLT for 
Kn,r but assuming that {pj) vary with n is a suitable way. 



6. The variance 



If V{t) does not go to oo then K{t) need not converge in distribution at all. 
Thus it is important to have criteria for V(t) —^ oo and to understand other 
possible modes of bahaviour of the variance. For various {pj) the variance V{t) 
and Vn can go to oo, converge to a finite limit, oscillate within a bounded range, 
or even oscillate between to oo. 

In this section we sketch some recent results from IS] . The next lemma relates 
the variance with the mean number of singleton boxes. 

Lemma 5. There exists an increasing function T{t) which satisfies T{t)/t S 
]1,2[ fort>0 and 

V{t)^^<i>,{T{t)). 

Proof. For i > the equation ^{2t) — = t$'(T) has a unique solution 
T = T{t), which increases because is concave. □ 



It follows that V{t) oo is equivalent to $i(t) oo. If V{t) is bounded then 
is bounded and V{n) — T4, ^ by Lemma [TJ We record some sufficient 
conditions for V{t) oo. 

Proposition 6. Each of the following conditions implies V{t) — > oo; 

(i) Vi'{x) oo {x ~^ 0), where \7i'{x) :— i'{x/2) — ^{x) = #{7 : x/2 < pj < 

(ii) liminfj pj^k/Pj > 1/2 for every k — 1,2, . . . 

(iii) limj^^ pj / J2i>j Pi = 0. 
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Proof. Sufficiency of conditions (i) and (ii) is shown by rewriting ([6]) in the form 

/"OO 

V{t) = <i>{2t) ~ = t / e-*^Vi/(a;)dx . (15) 
Jo 

For (iii) one exploits Lemma [51 □ 

All three conditions (i), (ii) and (iii) are satisfied (hence, V{t) and Vn converge 
to oo) if Pj+i/pj -^1 (j — > oo). 

We turn next to conditions for bounded variance or converging to a finite 
limit. The case of geometric frequencies gives a clue. 

Example 7. Suppose (pj) is a geometric sequence pj = (1 — q)q^ with ratio 
< g < 1. If g = 1/2 then Viy{x) = 1 for < a; < 1 hence from V{t) -> 1. 
More generally, for q = 2^^/*^ for some k = 1, 2, ... we have Vi/(a;) = fc (0 < 
X < 1), hence V{t) — > k. For other values of q the variance does not converge, 
rather it has an asymptotic expansion V{t) = log^/^ 2 + f/(log-i/q t) +o(l), where 
g is a. periodic function with mean and a small amplitude [2]. 

Proposition 8. A finite limit v :— lim„^oo Ki exists if and only if the 
frequencies satisfy 

hm ^ = i 

j^oo Pj 2 

for some integer k >1, in which case v = k. 

It can be shown [;8J that the condition of proposition holds if and only if (pj) 
can be split in k nonincreasing disjoint subsequence (Pj*') [i — 1, . . . ,A;) such 
that for each of these subsequences pj+i/Pj'^ 1/2, hence the variance for 
sampling from (cp^*^) approaches 1. 

Proposition 9. limsupy(<) < oo if and only if there exists a positive integer 
k such that for all j — 1,2, . . . 

Pj+k ^ 1 
Pi " 2 ■ 

Moreover, if for some /c G N 

linisup^<i (16) 

j— >CXD Pj 2 

then limsupy(<) < k, and this asymptotic bound is the best possible for fre- 
quencies satisfying (jl6p with given k. 

Many examples of irregular behaviour of V{t) or $r(i)'s can be constructed 
using the following simple idea. Consider first a series of finitely many, say 
m, boxes with the same frequency q. In the Poisson scheme, the variance of 
the number of occupied boxes among these m is m(e~*' — e"^*"^) which is a 
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unimodal function with the initial value 0, the maximum value m/4 assumed 
at q^^ log 2, and exponential decay for larger t. Similar properties has the mean 
number of boxes occupied by r balls, which is mtqe~*'^. Note that m accounts for 
the maximum value, while varying q amounts to just rescaling the time. Now, 
selecting qi > q2 > . . . and taking rrii frequencies equal (i = 1, 2, . . .) we obtain 
a superposition of functions of the above type, thus creating oscillations with 
fairly arbitrary highs and lows. In the following examples we focus on the Poisson 
scheme, hence can ignore the normalization and only require '^jPj < oo. 

Example 10. Choosing qi = q^ with some < g < 1/2 and rrii = i we 
obtain a collection of frequencies (pj) for which V{t) oo {t oo) but Vj^(x) 
oscillates between and oo. The example shows that condition (i) of Proposition 
[S]is not necessary for V{t) ~> oo. 

Example 11. 28, p. 384] Choosing q, = 2^2* ^nd nii — i we obtain (pj) for 
which V{tk) oo along tk = 2^'' but V{tk) along tk = 2^'°+'= (fc oo). 
Thus V{t) oscillates between and oo, and the same applies to ^i{t). 

Example 12. [gj Choosing qi = 2^^'^^ and rrii — 2^' we obtain (pj) for which 
V(t) — > oo and ^i{t) — > oo, while lim inf <i>2 (i) — and limsup<l>2(t) = oo. Thus 
we have here a curious pathology, when the mean number of singleton boxes 
goes to oo, but the mean number of doubletons does not. 



Remark. The last example disproves the assertion of ^28, Lemma 1] that if the 
convergence radius of the series '^jPjU^ exceeds 1, then iy{x) is slowly varying 
for a; — > 0. Recall that slow variation means limx-to i'{cx) / iy{x) = 1 for every 
c > 0. In the last example the convergence radius of '^jPjU^ equals 2, because 
(jpjY^-' oscillates between 1/4 and 1/2, as is readily checked. The multiplicity of 
each frequency qi = 2~^ is equal to the number of terms larger than this value 
(fc = 1, 2, . . .), which together with c2~^ < 2~^ (for fixed c > 1 and k = k{c) 
large enough) implies that 



fe— too vi^l ^ j Z 



hence slow variation fails. 



7. Regularly varying frequencies 

We shall establish equivalent forms of regular variation in the occupancy prob- 
lem, in particular we translate them in terms of the growth of mean values of 
Kn and iC„,r's. 

Following |28| we say that the frequencies {pj) are regularly varying if 



v{x) ^i{l/x)x-'' {x[Q) 



(17) 
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for some < a < 1 and a function £ slowly varying at oo, i.e. satisfying 
i{cy)/i{y) — > 1 as y oo, for every c > 0. The case a = corresponds to 
slow variation, while in the case a = 1 we shall speak of rapid variation. In the 
case a — I the summability of frequencies forces the function £(l/x) to approach 
(as X I 0) sufficiently fast. 

Define for r = 1, 2, . . . the measures 

i^r(dx) :— x^i'{dx) — y~^Pj(5pj (dx), 

so ^r[0, x] — < x). The measure vi is the distribution of the frequency 

of the first discovered box, also called the structural distribution or the law of 
the tagged fragment, especially when the frequencies are random ^ |33] . 

Proposition 13. The relation ([T7]) with < a < 1 is equivalent to 

„,[0,x]^ -^x'-^£{l/x) (xjO), (18) 
1 — a 

and for r > 1 it implies 

iyr[0, x] — x^-'^£(\lx) (x i 0). (19) 

r — a. 

Proof. Integration by parts yields 

V]^,x\— \ uv{Au) = —xV{x^ ^ / j7(m) dw . (20) 
■/o ^0 

If ([T7|) holds then, by Karamata's theorem [151 Theorem 1, Section 9 Ch. 8], the 
integral term is asymptotic to (1 — ol)'~^x^~°'£(\Ix^, which readily yields (fT8|) . 
To show the converse implication we use 

/•oo /"OO 

v{x^ = / v\(&u) — c — x^^v\^,x\^ I z^i [0, u] du , (21) 



evaluate the integral by Karamata's theorem and note that the constant term 
c is dominated. 

Similarly, applying Karamata's theorem to the integral term in 

/•x ^x 

Vr\^,x\~ \ u^i'{du) — —x^i7{x) + r / u^~^i'{u)du, 
Jo Jo 

we conclude (IT9l). □ 



The cases of slow and rapid variation need special treatment. 
Proposition 14. For £ slowly varying the relation 

i^ix) ~ x-h{l/x) {x I 0) (22) 

implies 

:.i[0,a;] -^i(l/a;) (x i 0), (23) 
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with I another function of slow variation defined for y > 1 by 

£^{y) = / u-^£{u)du. (24) 

For r > 1 the relation also implies 

i^r[0,x] ^—x'^-Hil/x) (xlO). (25) 

r — 1 

In general, the relation (1^5]) with some slowly varying li only implies 

V{x) ^x-Hi{l/x) [x[Q), (26) 

and does not imply the regular variation of ^{x); however if the regular variation 
holds then v{x) fulfills (j22p with £ satisfying (j24p . 

Proof. Suppose ([22]) is true. The integral ((24)) converges due to J2Pj — 
Jp°° xi'{dx) < oo, and £i ^ £ since the integral of diverges. The relation 
(|23|) follows from (j20|) by noting that the second term in the right side of (I20p 
dominates the first. Asymptotics ([25l) follow by Karamata's theorem. 

Conversely, (|23p entails (pS)) by the virtue of (I2ip . If i7(2;) is regularly varying 
then a direct argument shows that ([22| and ([24| must hold to match with ((23)) . 
□ 



Proposition 15. For £a slowly varying the relation 

iyi[0,x]^ x£oil/x) (xiO) (27) 

implies 

iy{x) - e{l/x) {x i 0), (28) 
with another slowly varying £ ^ £o defined for y > 1 by 

£{y) ^ J\-ho{u)du , (29) 

and also implies for r > 1 

i^r\0,x] r~.-x''£o{^/x) {xiO). (30) 
r 

In general, the relation (|28p with some £ only implies 

iyi[0,x]^x£{l/x) (xlO) 

and does not imply the regular variation o/i^i[0,a;]; but if the regular variation 
holds, then ()27p is satisfied with slowly varying £q related to £ via (|29p . 
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Proof. The line of argument repeats the one in the previous proposition. Formula 
(|5n|) is obtained by evaluating the terms in 

Jo Jo 

□ 



The last proposition shows that (|28|) (i.e. (fT7|) with a = 0) is not strong 
enough to control i/r^s, but a slightly stronger assumption l|27p is enough for 
that. The case of geometric frequencies demonstrates that (|28p is indeed too 
weak. 



Example 16. Forp^ = {l-q)q''^^, fc = 1, 2, . . . (0 < g < 1) and i' = J^k^Pk 
we have 

^ +l^|log^(l/a;)| (xjO) 



slowly varying, but 



log„ 



1-q 



iyi[0,x] =griog,V(l-9)l 



is not regularly varying, since x '^I'llO, x] oscillates between {1 — q) ^ and q{l — 

We translate the above in terms of the mean values. 

Proposition 17. For < a < 1, condition (jl7p is equivalent to each of the 
following two relations 

~r(l-a)r^(t) (t^oo), (31) 

and 

<I>i(t) ~ ar(l - a)r^(t) (i->oo), (32) 
and for r > 1 it implies 

^r{t) - (t ^ oo). (33) 

Proof. Writing 



Jo 



e *^j7(a;)da;. 



we see that the equivalence of (|3T|) and (fT7| follows by the Tauberian theorem 
for monotone densities [TSt Theorem 4, Section 5, Ch. 13]. The equivalence of 
((32|) and (fT8|) follows from another form of the Tauberian theorem [HI Theorem 
2, Section 5, Ch. 13]. In the same way, ([55]) follows from p^. □ 



In the case of rapid variation we have: 
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Proposition 18. The relation implies 

- ^i{t) - Ui{t) {t -> oo), (34) 
with ii as in (|24p . and for r > 1 it implies 

<i>r{t) ~ (t ^ (35) 

Also, (j23p is equivalent to ^i(t) ^ t£i(t) (t oo). 

Proof. Use Tauberian arguments and Proposition 1141 □ 

In the case of slow variation we have: 

Proposition 19. The relation implies ^{t) ^ £{t) {t — > oo), and for 
r > 1 

^r{t) ^ -lo{t) ^ oo), (36) 
r 

where £ and £o are related as in (j29p . Also, (j27p is equivalent to the r = 1 
instance of (j36p . The relation (j28p is equivalent to $(t) ^ £(t) (t —>■ oo) and 
entails $r(i) ^ 'l'(i) (t — > oo) /or every r > 1. 

Proof. Use Tauberian arguments and Proposition 1151 □ 

We summarize the above relations for <i>(i) and ^r{t)'s (the relations for 
and <i>„,r's are completely analogous): 

Corollary 20. In the case < a < 1, for r > 1 all ^r{t) are of the same 
order of growth as $(t), and the ratios ^r{t)/^{t) converge to (— 1)'^(") (these 
numbers comprise a probability distribution). 

In the case of rapid variation ^r{t) 's are of the same order for r > 1 but 
^(t) ~ ^i(i) 3> ^r(t) for r > 1, that is most of the occupied boxes are singleton. 

In the case of slow variation under condition (|27p we have r'^^r{t) ^ 
^lit) ^ meaning that all ^rit)'s are again of the same order but each 

of them is much smaller than ^(t). 



Remark. The results about mean values are true also when (pj) are ran- 
dom, with i> understood as the intensity measure of the point process Sp^ , 
and vi being the structural distribution. For instance, when (pj) are Poisson- 
Dirichlct((?) frequencies we have i'i{dx) = 0{l — a;)^~^, so we are in the case of 
slow variation ([27|) with i^i[0,a;] ~ dx, hence ^r{t) /^i{t) 1/r meaning that 
in the long run the mean number of balls in all rton boxes is approximately the 
same, for each r. The last fact has been long known, especially for 9 = I in the 
context of random permutations which can be associated with a random sample 
from Poisson-Dirichlet(l), see [I]. 

We conclude as in l28l: 
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Corollary 21. For < a < 1 the condition of regular variation p7p is 
equivalent to any of the following conditions 

if„-a...r(l - (n->oo), if(i)^a.s.r(l - a)r^(t) (i -> oo), 

and it implies for r ~ 1,2,... 

^ ^ L n°'e{n) {n ^ oo), if,.(t) ^ '- t"£(t) {t ^ oo). 

Proof. Combine Proposition [2l the remarks after it, and Proposition [T7l □ 

We stress that in this situation the strong laws for Kr{tys follow from the strong 
laws for increasing processes J2s>r ^s{t) and the fact that J2s>r ^sit) x $(i) ^ 

nit). 



Remark. Characterizations of regular variation through behaviour of ratios of 
integrals with distinct kernels are known as Mercerian Tauberian theorems [7]. 
For instance, <i>i(t)/$(t) a for some constant < a < 1 implies (|17p . 

When pT]) holds with < a < 1 the covariance matrix ([7]) becomes 

aT{r + s — a) 



r'.s 



aT{2r - a) . 'ar(r - a) 

(^r.r = T-j ^ H i 

Using arguments similar to that in Section [S] Karlin Theorem 5] showed 
that the array 

converges in distribution to a multivariate Gaussian array with zero mean and 
this covariance matrix. A similar result [28, theorem 5'] is valid also in the rapid 
variation case a = 1, but Kn^i requires a scaling different from the scaling 
of other Kn^s with r > 1, since Var[X„^i] ^ ni*{n) is of larger order than 
Var [Kn^r] ~ Crn£{n) for r > 1. 



8. Two uses of inversion 



We recall further facts about regular variation. For function h : M+ K+ 
regularly varying at infinity with index a > there exists an asymptotic inverse 
function g which is regularly varying with index 1/a and satisfies h{g[y)) ~ 
9{h{y)) y {y oo). The function g is unique up to the asymptotic equivalence, 
see [Zl Proposition 1.5.12]. 

For £ a function of slow variation at infinity, its de Bruijn conjugate is another 
function of slow variation satisfying 

W*{yl{y))^l, t*{ynyl*{y))^l (y oo) . (38) 
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The de Bruijn conjugate exists and is uniquely determined up to asymptotic 
equivalence, see [7J Proposition 1.5.13]. For instance (logy)* ~ (logj/)^^. The 
following inversion formula is adapted from [7, Proposition 1.5.15]. 

Lemma 22. Let h and g be asymptotic inverses of one another, then for 
a > and i slowly varying 

h{y)^y^i{y) (y ^ oo) ^ .g(y) ^ (y ^ oo). (39) 

Let I* be the reciprocal of the function appearing in the inversion formula 
(Hi): 

tiv) ■= ^- 

■ {^l/"(yl/")}#- 

Keep in mind that I* depends on a. This function allows one to formulate the 
property of regular variation directly in terms of individual frequencies. 

Proposition 23. The condition ([17]) with < a <l is equivalent to 

pj-t{j)r"'' (j-^), (40) 



and it implies 



Proof. Consider 



In view of 



r(fc)Ai-i/" (fc^oo). (41) 

j>k 



h{y) = v{l/v) and g{y) = — 

P\v^ 



'^{^/y) = max \j:pj>-\^sup\z: ^— < y 

{ y) { p\z] 

the function h is the generalized left-continuous inverse of g, and because y < 
h{g{y)) < 1, these functions are also asymptotic inverses of one another. By 
Lemma[22l the relation h{y) ~ y°'£{y) is equivalent to g{y) ~ ?/^/"{£^/"(2/^/")}#, 
which is the same as (l40l). □ 



For fixed-n allocation scheme define A^fc := min{n : = k} to be the 
times when new boxes are discovered. The analogous poissonized quantity is 
Tk := min{t : K{t) = k}. By the definition 

KN,=KT,=k (A: = 1,2,...). 

Next proposition gives the relations inverse to that. 
Proposition 24. Under ((T7]) with Q < a <l we have 



iVfe-a.s.Tfc^a.s. £*(fc) 
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Moreover, the following asymptotic relations hold: 

( k 1 

"''^ ^ ( r(l-a) ) ^ Kn,-^.s K{nk)-^^,sk {k oo), (42) 

fc„ ~r(l-a)n"£(n) ^ ^fc„-^a.s.Tfc„-a.s.n {n ^ oo). (43) 

Proof. Immediate from the uniqueness of the asymptotic inverse and de Bruijn 
conjugate, and Lemma [3^1 D 

9. The cumulative frequency of empty boxes 

Functionals Km Kn^r are of the form 

j 

called 'separable statistics' by some authors |26[ [T7] . Here, ipj^s are some func- 
tions for which the sum is well defined. In this section we discuss one more 
instance of this kind, 

j j 

under the regular variation assumption (|17p with < a < 1. These have the 
meaning of the cumulative frequency of yet undiscovered boxes. 

Defining (pk) to be a random arrangement of frequencies in the order as the 
boxes are discovered, we can also write 

Sn^ ^ Pk, S{t) ^ ^ Pk, Rk.= ^Pj. 

k>K„ k>K(t) j>k 

The sequence (pk) is called a size-biased permutation of the frequencies (pj), 
see [33]. By ^ and dH]), 

E [Sn] = n~^^{n, 1), E [Sit)] = t"^$i(i). 

These mean values control the geometric number of balls (respectively, the ex- 
ponential time) needed to discover yet another box after some time. Clearly, 

— S{Tk) — Rk, Rk„ = Sn, RK{t) = Sit). (44) 

Proposition 25. Under assumption dTT]) with < a < 1 

Sn-..sS{n) ^a.s. aT[l~a)n'--H{n) (n ^ oo) (45) 

Rk ~a.s. a{r(l -«)}!/" A;i-i/"r (A:) (fc ^ oo). (46) 
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Proof. From S(t) — \ {Xj{t) ~ 0)pj using the same asymptotic evaluations 
as for p3p we compute the variance as 

/>oo 

Var[^(t)] ^ ^Pjle-^*^* - e-^P.*) = J ^^(e-*" - e^^'^^yidx) ~ 
Thus we have 

E [S{t)] = - cii"-i^(t), Var[5(i)] ~ 02^"^^^. 

with some positive constants ci , C2 . Using Chebyshev's inequahty and the Borel- 
CantelH lemma it is not hard to show that the convergence S'(tm)/E [S'(im)] ^a.s. 
1 is secured along a sequence tm ~ m^/". But then S{t)/K [S{t)] ^a.s. 1 follows 
for i ^ 00 by the usual sandwich argument, since S{t) and E [S{t)] are nonin- 
creasing and tm/tm+i — > 1 as m 00. The rest of (|45|l follows by checking that 
with probability one S{n{l + e)) < Sn < S{n{l — e)) holds for all sufficiently 
large n. 

To pass from (P5|) to P5|) we ffi'st note that for Uk as in 

5„.~a...a{r(l-a)}i/"fci-i/"r(fc) (fc->oo), (47) 
as is easily seen from 

{^i/"(y)}#^i/" (2/{^i/"(2/)}#) ^ 1, 

which in turn is the second identity in ([55]) . By Proposition [M] the bounds 
'^fc(i-c) < -^fc < ^fc(i+e) hold for sufficiently large k with probability one. Finally, 
the first identity in (|44p and monotonicity imply that for large k 

and now P5|) follows from ([17)) by letting e — » 0. □ 



Remark. Comparing ([46]) with (|4T|) we see that 

Rk/ {^(2 - _ > 1 

which supports the intuition that the ranked arrangement of frequencies (pj) 
decays faster than the size-biased permutation However, despite the fact 
that the tail sums of the sequences {pj) and (pj) decay with the same order, the 
analogue of (|40|) for {pj) does not hold. Indeed, by regular variation we have 
Pi/pi+i —>■ 1 (i 00). Pick e small, and let pi,pi+i be the two largest frequencies 
smaller than e, and such that pi/pi+i is close to 1. In the Poisson setup, let a be 
the time needed to discover one of the two boxes with these frequencies pi, Pi+i, 
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and b be the time to discover both of them. These can be expressed through 
independent exponential variables, showing that the ratio a/b assumes values 
smaller than, say, 1/2 with a probability bounded from zero. But this and the 
asymptotics of K{t) readily imply that, with probability bounded away from 0, 
the positions oipt and pi+i in the re-arranged sequence (pj) are not asymptotic 
to one another, which could not happen if (pj) were asymptotic to a regularly 
varying sequence. 



10. Pure power laws 

As has been mentioned in the Introduction, processes of coagulation and frag- 
mentation of random masses [B] 133] are related to occupancy schemes where the 
frequencies {pj ) are random. In many situations [22l [24l [32l [5] one encounters a 
relation 

Kn^^.^Dn" (n^oo) (48) 

for the number of blocks of partition induced by sampling from (pj), where 
< a < 1 and D is a strictly positive random variable (in this context, a 
measure of 'a-diversity' of the partition [33]). The asymptotics (j48| say that, 
given D, Kn is regularly varying with constant slow variation factor. 

Conditioning on the frequencies and applying Proposition [2] we always have 

K^^,.sE[Kn\iPj)], 

hence, by Proposition [iTl (|48|) is equivalent to 

#{j : Pj > ^}-..s. ^,f , x-^ (x ^ 0), (49) 
and by Proposition [531 it is also equivalent to 

D ^ 



Furthermore, any of these implies for r = 1, 2, . . . 

a(l - a) • • • (r - 1 - a) 

^«.r~a.s. j Dn°' (n^oo), (51) 

r! 

and for size-biased frequencies 

j>k 

The relations ([15]) , ([IH]) , ([SH]) , ([5T]) , ([52]) appear in the literature under the folk 
name 'power laws'. Typically the starting point is (j49p . from which one arrives 
to the conclusion that Kn/n" and Kn,r/n" converge almost surely to multi- 
ples of the same random variable. But other direction is also useful: from the 
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asymptotics of Kn, established by either the analysis of moments or some other 
method, one can make conclusions on the behaviour of small frequencies. 



Remark. Interestingly, the distribution of Kn does not converge to normal, 
although this is true for Kn conditioned on (pj). An explanation for this phe- 
nomenon is that the randomness of (pj) dominates the variability due to random 
sampling. The CLT for Kn does hold for sampling from Poisson-Dirichlet(0) (in 
which case Kn ~a.s. Ologn), and has been also shown for some instances of 
random [pj ) under more general assumptions of slow variation [H 1191 123] . 



11. Strong lavifs for large parts 

We collect here explicit distributional formulas and a few ways to describe the 
multivariate asymptotics of the 'large parts'. 

For Xl :— (X^ ^, j = 1, 2, . . .) the sequence of Xnj's arranged in nonincreas- 
ing order we have 



j — l • distinct 

k 

(ni > . ■ ■ > fih > 0, n — Uj 

.7 = 1 



where the sum expands over all fc-tuples of distinct positive integers ji, . . . ,jk- 
Let Xn := {Xnj, j = 1, 2, . . .) be the sequence which starts with positive terms 
of X}^ arranged in the order as the boxes are discovered by the balls, and ends 
with infinitely many zeroes. Similarly to the above. 



ll,=iK- - ^)Knj+nj+i + ... +nk} ^i^tinct 



31,--.,3fc 

where ni > 0, . . . , rife > 0, n — ni + . . . + Uk. Let (pj, j = 1,2,.. .) be the 
size-biased permutation of (j>j). In particular, Xn,i is the number of balls out 
of the first n which fall in the same box as the first ball, and pi is the frequency 
of this box. The latter distribution conditionally given {jij ) is 

P(xi_, =n„ j = l,...,fc|(p,)) 



J|(l-pi - 



\\j=iV^3 - + + . . . + rife) _ 

To proceed to the asymptotics, recall that the succession of hits in box j 
undergoes a Bernoulli process with success probability pj , hence by the classical 
law of large numbers n^^Xnj — >a.s. Pj as tt, — > oo. The following multivariate 
extensions of this result involve various arrangements of the boxes, and feature 
the behaviour of 'large' parts which grow linearly with n. These results are of 
fundamental importance in Kingman's theory of exchangeable partitions [6l [33] . 
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Proposition 26. As n ^ oo we have 

n 
n 
n 

where the convergence is understood in the product topology. 

Proof. The first relation amounts to the marginal convergence. The second rela- 
tion follows from J2j n-^^^nj ~ 1 and the fact that ranking is a continuous map- 
ping on the infinite-dimensional simplex {{xi, X2, ■ ■ ■) ■ Xj > 0, '^j^j = !}■ 
The third relation is a known consequence of the second [33]. □ 

Many questions on random allocations also involve the ordering of the boxes. 
One can be interested in the last occupied box [T3] or the first empty [TT] etc. 
This theme lies outside the scope of this paper, and we only mention one gener- 
alization of (j53p that appears in the theory of exchangeable ordered partitions 

mini- 
Let <! be a strict total order on positive integers, thought of as some 'arrange- 
ment' of boxes. With each box j we associate an open interval ]aj , bj [ of length pj , 
with endpoints aj = i<ij} Pi ^^'^ — Let C = [0, 1] \ (Uj ]aj,bj[) be 

the complementary closed set. For each n define a random finite set C„ comprised 
of and of all distinct elements of the sequence {J2{v i<j} '>T-^^Xn,i , j = 1, 2, . . .). 

Proposition 27. For d^ the Hausdorjf distance on the space of closed sub- 
sets of [0, 1], 

dff(C„,C) ->a.s. (n^oo). 
Proof. This follows from ((54)l and the easily established convergence 

For instance, define the order <l by choosing a sequence of distinct reals {i/j , j = 
1, 2, . . .) and setting i<\j iff y.i < yj. Consider the discrete distribution fi{dy) = 
Sj Pj ^vji'^y) which places mass pj at point yj. For an independent sample of 
n elements from fj,, the finite set C„ encodes the nonzero counts of repeated 
values in the sample, in the natural (increasing) order of the values. The set 
C can be identified with the quantile transform of /Lt [18j . For example, < is 
the standard order on integers for yj = j (j = 1,2,...), in which case C = 
{0,pi,pi + P2,Pi + P2 + P3, ■ ■ ■ and Proposition [27l amounts to (j54p . A more 
sophisticated example appears when {yj} is the set of all rational numbers, in 
which case C is a Cantor set. 

Acknowledgement We are indebted to a referee for constructive comments. 



" Xn ^a.s. (pi,P2, • ■ •)> 
"^X^ -^a.s. (Pl,P2, • ■ •). 



(53) 
(54) 
(55) 
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